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ABSTRACT 

Our aim is to present a fast and general Bayesian inference framework based on the synergy between 
machine learning techniques and standard sampling methods and apply it to infer the physical prop- 
erties of clumpy dusty torus using infrared photometric high spatial resolution observations of active 
galactic nuclei. We make use of the Metropolis-Hastings Markov Chain Monte Carlo algorithm for 
sampling the posterior distribution function. Such distribution results from combining all a-priori 
knowledge about the parameters of the model and the information introduced by the observations. 
The main difficulty resides in the fact that the model used to explain the observations is computation- 
ally demanding and the sampling is very time consuming. For this reason, we apply a set of artificial 
neural networks that are used to approximate and interpolate a database of models. As a consequence, 
models not present in the original database can be computed ensuring continuity. We focus on the 
application of this solution scheme to the recently developed public database of clumpy dusty torus 
models. The machine learning scheme used in this paper allows us to generate any model from the 
database using only a factor 1CP 4 of the original size of the database and a factor 10~ 3 in computing 
time. The posterior distribution obtained for each model parameter allows us to investigate how the 
observations constrain the parameters and which ones remain partially or completely undetermined, 
providing statistically relevant confidence intervals. As an example, the application to the nuclear 
region of Centaurus A shows that the optical depth of the clouds, the total number of clouds and the 
radial extent of the cloud distribution zone are well constrained using only 6 filters. The code is freely 
available from the authors. 

Subject headings: methods: statistical, data analysis — galaxies: Seyfert, nuclei — infrared: galaxies 
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1. INTRODUCTION 

It is customary that physical information about astro- 
physical objects cannot be obtained directly from the 
observables. In such a case, astrophysicists propose a 
plausible scenario described by a physical model and the 
procedure is to compare the observables with the predic- 
tions of the model with the aim of inferring the physi- 
cal parameters of the model. The presence of degenera- 
cies (either induced by the presence of noise or intrinsic 
to the model) introduce complexity in the analysis and 
they need to be taken into account. This is the subject 
of Bayesian data analysis that, although it is rooted on 
ideas developed in the 19th century, it has become prac- 
tical only in the last decades. 

The fundamental idea behind Bayesian data analysis 
is to take into account that all parameters of a model 
can be understood as random variables with associated 
probability distribution functions. The standard prob- 
lem of model fitting is usually seen as finding the set 
of model parameters that better reproduce the observ- 
ables. However, the Bayesian approach is far more in- 
formative and transforms the problem into finding the 
probability distribution function associated with the pa- 
rameters of the model once the data set is taken into 
account. In the presence of noise and/or degenera- 
cies, these probability distribution functions represent 
the complete solution to the problem and automatically 
include all the statistical information about the param- 
eters that can be inferred from the observables. We 
have witnessed an enormous interest in Bayesian infer- 
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ence in Astrophysics in the last decade. The reason for 
this resides in two facts. First, the quality and amount 
of observed data is usually deficient and one has to 
rely on methods that exploit to the limit the reduced 
amount of information. Second, the applied physical 
models are sometimes too complex as compared with the 
data available to constrain them. To mention a few re- 
cent works, we find applications in cosmological anal- 
yses ( e.g.. Eewis fc Bridle! l2002t iRubino-Martin et al.1 
120031 iRebolo et a l.l 12004 iTrottal 12008ft gravitational 
wave analyses (e.g., [Cornish & Crowder 2005h. grav- 
itational lensing (e.g., IBrewer fc Lewis! l2006f). oscilla- 
tion of solar- like stars (e.g.. IBrewer et al.ll2007l). anal- 



ysis o f spectropolarimetric data (|Asensio Ramos et all 
l2007al). analysis of extrem e ultraviolet spectral line fluxes 
(|Kashvap & Drake! Il99cl . and more. 

As we review in Appendix \^ Bayesian inference tech- 
niques can be essentially reduced to the ca l culation 
of multi-dimension al integrals ( e.g., iNeall Il993t iSkillind 
2004: lGregorvl2005t lT7ottal[200l . In very simple models, 
these integrals can be carried out analytically. However, 
more realistic problems cannot be analytically treated. 
The explosion of Bayesian analysis methods in the last 
decades has to be associated with the set of efficient sam- 
pling techniques today kn own as Markov Chain Monte 
Carlo methods (MC MC; Metropolis et~aI1 fl953L iNeail 
Il993t iGregorvl l2005f h In spite of their success, these 
methods also present the drawback of being computa- 
tionally intensive because the proposed model has to be 
evaluated many times. As a consequence, the execu- 
tion time of these techniques is quite high if the eval- 
uation time of the model is non-negligible. For this rea- 
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son, there have been some efforts in recent years towards 
reducing the evaluation time of the models at the ex- 
pense of a small lost in accuracy. They are based on 
the development of approximate methods that are able 
to "learn" a database of models for many combinations 
of the model parameters. For instance. iFendt fe Wandeltl 
(|2007f ) developed a method based on polynomial interpo- 
lation for the rap id cosmolog i cal p arameter estimation 
problems. Later, lAuld et "al . (2008) used a neural net- 
work approach for the calculation of cosmic microwave 
background power spectra (only for models in a small hy- 
percube around the commonly accepted region of most 
probable values for the cosmological parameters) leading 
to a very fast Bayesian cosmological parameter estima- 
tion code. 

Our main aim in this paper is to present 
BayesCLUMPY, a computer program that allows us 
to efficiently carry out Bayesian analysis of observed 
spectral energy distributions coming from the inner 
region of active galactic nuclei (AGN). To this aim, we 
use the recentl y develo ped clumpy dusty torus model of 
iNenkova et~aH (|2008aHbl ). known as CLUMPY models, 
and develop a MCMC code whose output is the proba- 
bility distribution function for all the parameters of the 
CLUMPY models once the observations are taken into 
account. As a consequence, the code yields statistically 
significant estimations of the parameters and, more 
important, statistically relevant confidence intervals. 
This facilitates the investigation of degeneracies and can 
be also used to suggest future observations that can help 
us introduce stronger constraints in the inference. The 
code is based on a recently released on-line database of 
CLUMPY models 1 . W e apply an i n terpo lation method 
like that presented by lAuld et all (|2008ft that greatly 
accelerates the evaluation of models. In our case, 
we manage to make the approximation method work 
correctly for the whole database and not only for a small 
hypercube, thus allowing us to efficiently explore the 
full space of parameters. Finally, the Bayesian character 
of the approach allows the user to include any a-priori 
knowledge about any parameter. 

2. CLUMPY MODELS 

According to the Unified Model for S eyfert galaxies 
(|Antonuccil 11991 lUrrv fe Padovanil 1 1995ft . Type-2 AGN 
are the edge-on counterparts of the face-on Type-1 AGN. 
This way, in Type-1 AGN the broad-line region (BLR), 
that is surrounded b y a dusty torus of a few parsecs 
([Tristram et al.ll2007fj , is observed directly, together with 
the narrow-line region (NLR) emission, whereas in the 
case of Type-2 AGN, only the NLR emission is seen di- 
rectly. However, the Unification Model is not universally 
applicable, since there are several galaxies that do not 
reveal the broad lines in polarized light. 

Regardless, it is clear that there is dust surrounding 
the central region of AGN distributed in a toroidal shape. 
The dust grains in the torus absorb the ultraviolet pho- 
tons from the central engine and, after reprocessing 
the radiation, are re-emitted in the infrared range. 



models dPier & Krolikl 119921: 


Granato & Danesd Il994h 


Efstathiou & Rowan-Robinson 


119951 iGranato et all 



1 https : //newton .pa.uky . edu/^clumpyweb/ 



11997ft have not been confirmed by the observations, 
the search for a more distributed or complex geome- 
try of the abso rbing material around the AGN have 
been promoted (INenkova et alj l2002t iFritz et alj 120061 : 
lElitzur fe Shlosmanll2006l : iBallantvrie et al.ll2006lh 

The cl umpy dusty torus models (INenkova et al.l|2002l . 
I2008allbl : iHonig et al.ll2006t ISchartmann et al.ll2008ft pro- 
pose that the dust is distributed in clumps, instead of 
homogeneously filling the torus volume. As an example 
of the success of these models, they permit to explain, for 
example, the observed mutations between Type-1 and 
Type-2 objects detec t ed in the spectra of a few AGN 
(jAretxaga et al.lll999l : iTrippe et al.ll2008l ). 

Since the reprocessed radiation from the torus is emit- 
ted in the infrared, this range is key to put constrains to 
the clumpy dusty torus models. High-resolution obser- 
vations at these wavele ngths are mandatory , due to the 
small size of the torus (jTristram et al.ll2007ft . This way, 
it is possible to separate the nuclear emission from that 
of the surrounding galaxy. Important observational con- 
straints for the torus models come then from the shape 
of the infrared spectral energy distributions (SEDs). Ac- 
curacy in the photometry a filter set spanning a broad 
wavelength range, and well-sampled SEDs are required 
to restrict the model parameters. 

The CL UMPY models that w e use in this work (de- 
scribed in INenkova et alJ l2008al fbl) consist of a clumpy 
distribution of matter with a radial extent characterized 
by Y — R /Rd, that is the ratio of the outer to the 
inner radii of the toroidal distribution. The inner ra- 
dius (i?d) is defined by the dust sublimation tempera- 
ture. Each clump is specified by its optical depth (ry), 
and all clumps are assumed to have the same optical 
depth. The dust extinction profil e corresponds to a stan - 
dard cold/oxygen-rich ISM dust ( Osscnk opf et alj [l992). 
These clumps, of a given dust composition, are heated 
by an AGN with a given spectral shape and luminos- 
ity. Thus, the inner radius (i?d) is determined uniquely 
by the luminosity and the chosen dust sublimation tem- 
perature. The number of clouds along a radial equatorial 
path is defined as N. The radial density profile is a power 
law (r~ q ), with an angular distribution characterized by 
a width parameter, a. 

3. BayesCLUMPY 

Every CLUMPY model requires several seconds to be 
calculated. For the typical lengths of converged Markov 
Chains, this would amount to something between 1 and 6 
days per Markov Chain for only one run of the inference 
problem. Obviously, this is something that is absolutely 
unacceptable if one wants to carry out inference for many 
AGN. 

The computational efforts carried out by the CLUMPY 
group has allowed them to calculate and distribute for 
public access around 2xl0 5 models (now increasing to 
more than 10 6 models). They have been calculated for 
a quite fine grid of model parameters. This database 
can be used to overcome the difficulty of evaluating the 
CLUMPY models by using it for interpolation. The rea- 
son for interpolation is that the MCMC code will propose 
models that are not present in the original grid, so that 
an efficient interpolation method has to be applied. If 
the interpolation method is fast enough, it will allows us 
to carry out systematic studies of the compatibility of 
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Fig. 1. — First 16 PCA eigenvectors obtained from the CLUMPY database. We have demonstrated that the full CLUMPY database 
can be decomposed using only the first 13 eigenvectors with great precision. The wavelength variation of the standard deviation of the 
difference between the original models and the truncated reconstruction is shown in Fig. [3] 



model parameters with different observations in a com- 
pletely Bayesian framework. Studies like analyzing the 
amount of information added by a given filter and which 
parameters can be confidently recovered from the data 
are possible under this framework. 

In the following sections we describe our approach for 
interpolation. It is based on the application of two dif- 
ferent machine learning techniques: principal component 
analysis (PCA) for dimensionality reduction and artifi- 
cial neural networks (ANN) for the i nterpolation . Such 
a method has been already applied bv lAuld et alj (2008) 
for approximate Bayesian inference of cosmological pa- 
rameter in a small hypercube around the commonly ac- 
cepted values of the cosmological pa rameters. A s imilar 
approach has also been employed by I Carroll et al.l (|2008l ) 
for the fast synthesis of Stokes profiles in magnetic atmo- 
spheres and the quick solution of Zeeman-Dopplcr imag- 
ing problems. 

3.1. Principal Component Analysis 

Each SED in the database is sampled at N\ = 124 
wavelength points. Clearly, some correlations exist be- 
tween different wavelength points, so that when the flux 
at a given wavelength is modified, the surrounding wave- 
length points are also modified in a very similar way (con- 
tinuity of the SED). As a consequence, the dimension of 
the non-linear manifold in which the SEDs "live" i s much 
smaller than 124 (see lAsensio R amos et a l. 2007b). This 
fact can be harnessed to apply dimensionality reduction 
techniques and efficiently compress the database. Al- 



though many complex technique exist, we apply here 
a very basic linear dimensionality reduction technique 
based on th e Principal Component Analysis (PCA; see 
lLoevejn~955() also known as Karhunen-Loeve transforma- 
tion. Briefly, the idea is to obtain a self-consistent basis 
(principal components) in which the data can be effi- 
ciently developed. This basis has the property that the 
largest amount of variance is explained with the least 
number of basis functions. It is useful to reduce the di- 
mensionality of data sets because most part of the vari- 
ability of the signal is carried by the first N' <C N\ 
eigenvectors. Note that, since PCA performs a linear 
analysis, it is not possible to reduce the dimensionality 
of the transformed manifold to the real dimensionality 
of the non-linear manifold and the number of necessary 
eigenvectors is larger than the number of physica l pa- 
rameters of the model (jAsensio Ramos et al.ll2007b| ) . See 
appendix IBl for more technical details on PCA. 

The first 16 PCA eigenvectors obtained from the 
CLUMPY database are shown in Fig. [TJ This figure 
shows that low-order eigenvectors are very smooth and 
take into account large-scale variations that are seen in 
the majority of the SEDs. On the contrary, high-order 
eigenvectors contain high-frequency details that produce 
small-scale details in a small amount of SEDs. Choosing 
only the first N' — 13 eigenvectors results in a very good 
representation of the whole database. In other words, 
this allows us to reduce the size of the database because 
we only need to give 13 numbers for each SED (the pro- 
jection of each SED along each PCA eigenvector) and 
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Fig. 2. — Topology of the artificial neural networks applied in this 
work. The neural networks contain an input layer of six neurons 
for the six fundamental parameters of the CLUMPY models. The 
output layer is composed of only one neuron that is associated 
with the projection of the SED along each PCA eigenvector. The 
intermediate layer (widely known as "hidden" layer) is used to 
obtain the non-linear mapping between the input and the output 
layers. 



their associated eigenvectors. Note that, for very large 
number of models > 10 5 , the reduction in size of the 
database tends to 13/N\, which is close to 1/10 in our 
case. 

Although the PCA eigenvectors can be usually 
associated to different physic al mechanisms (e.g., 
Skumanic h &: Lopez Ar istc 2002), our aim here is not to 
analyze them. We treat the PCA eigenvectors as a basis 
set of purely mathematical character that allows us to 
efficiently develop the database. Efficiency in our case 
means that the number of eigenvectors needed to repro- 
duce the database with a given error bar is the smallest 
possible. In any case, it is possible to see some well- 
known signatures like the 10 fiia one produced by dust 
emission/absorption in some eigenvectors of Fig. [T] 

3.2. Neural Network 

Although the PCA dimensionality reduction step has 
reduced the size of the database, it is still complex and 
time consuming to obtain the SED for values of the pa- 
rameters not present in the original grid. For this rea- 
son, we have developed an interpolation method based 
on a n artificial feed- forward neural network (ANN; see 
e.g., iNeall fl993T ). a widespread machine learning tech- 
nique that usually presents very good behavior. We 
have developed N' simple neural networks whose topol- 
ogy is shown in Fig. O The ANN consists of an input 
layer formed by 6 neurons that accept the physical pa- 
rameters of the CLUMPY models. The output layer is 
formed by one neuron whose value is the projection of the 
SED corresponding to the physical parameters of the in- 
put layer along each PCA eigenvector calculated before. 
Both layers are fully connected by an intermediate hid- 
den layer. The approximation properties of three-layered 
neural networks is somethin g known after the universa l 
approximation theorem (e.g., Cvbenko 1988; Ncal 1993). 
This theorem states that such a neural network, with a 
sufficiently large number of neurons in the hidden layer, 
can approximate any continuous function. We have ver- 
ified that, in order to get a compromise between the ap- 



proximation abilities of the neural network and the speed 
of evaluation, values of Nh between 30 and 50 give very 
good results. This method is not optimal because the 
values of Nh are set empirically. More refined meth- 
ods probably based on Bayesian model selection (or the 
ap proximate minimum d escription length method used 
by lAsensio Ram os 2006(1 can be used to infer the opti- 
mal number of hidden neurons based on objective mea- 
surements. However, for the purpose of our work, the 
employed method is enough to ensure good approxima- 
tion properties while maintaining a fast execution speed. 
The ANN uses the hyperbolic tangent activation func- 
tion. Prior to utilizing the neural network, all the input 
and output values are normalized to the interval [—1, 1] 
to improve the interpolation abilities of the network. 

The training of the ANN is done by modifying the 
weights until the minimizing the quadratic difference be- 
tween the output of the neural network and the correct 
values of the database (see appendixIClfor more details). 
It is important to note that over-fitting has been avoided 
using two data-sets chosen randomly from the database: 
one for training and one for validation purposes. The 
training process is stopped when the quadratic error de- 
creases for the training set but starts to increase for the 
validation set. We have verified that it is possible to 
carry out the training of the neural networks using only 
a subset of the full database, which greatly accelerates 
the process. This is a consequence of the smooth varia- 
tion of the SEDs with the physical parameters. Picking 
randomly from the database a training set with ~ 10% of 
the total number of models, the trained neural network 
does a very good job with the validation set and with the 
whole database. 

In order to analyze the ability of the PCA+ANN com- 
bination to reproduce the database, we show in Fig. [3]thc 
standard deviation (left panel) and the 99% percentile 
(right panel) of the distribution of differences in \F\/Fb \ 
between the exact SEDs of the full database and the re- 
constructed SEDs. The 99% percentile has been also 
represented in order to test the possibly poor general- 
ization properties of neural networks in regions close to 
the boundaries of the space of parameters. The dashed 
lines present the results when the PCA reconstruction is 
done using the original database. In such a case, the re- 
construction error is monotonically decreasing with the 
number of included PCA eigenvectors. If all the eigen- 
vectors are used, the reconstruction error turns out to 
be identically zero. With the first 13 eigenvectors, the 
reconstruction errors are below 5xl0~ 3 (lcr) and 10~ 2 
(99 % percentile) in all the wavelength range of interest. 
The solid lines are the reconstruction errors when the 
projections along the PCA eigenvectors are calculated 
by evaluating the artificial neural networks. Obviously, 
due to the approximate character of the neural networks' 
interpolation, the reconstruction errors are larger than in 
the exact case. The approximation abilities of the neu- 
ral networks worsen when the order of the eigenvector 
increases. The reason is that these eigenvectors contain 
high-frequency or less abundant signatures whose vari- 
ation with the parameters are less smooth. However, 
standard deviations of the reconstruction error are be- 
low 2xl0~ 2 in the whole spectral domain of interest, 
with errors going down to 10 -3 in some spectral win- 
dows. Concerning the 99% percentile, errors are always 
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Fig. 3. — Reconstruction errors characterized by the standard deviation (left panel) and 99% percentile (right panel) of the difference in 
^Fx/Pbol between all the SEDs of the original database and the reconstructed SEDs using an increasing number of PCA eigenvectors for 
the reconstruction. The quantity -Fbol i s the bolometric luminosity of the AGN. The dashed lines show the results obtained when the PCA 
coefficient of each SED is obtained using the exact SED for projecting along each eigenvector. The solid line corresponds to the results 
obtained when the ANN is used to obtain the projection along each PCA eigenvector. Note that the quality of the reconstruction depends 
on wavelength due to the existence of variable features in some spectral ranges. 
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Fig. 4. — Example of the interpolation obtained with the ANN approach. The black lines are the SEDs corresponding to randomly 
chosen physical parameters. The red lines are those reconstructed using the first 13 PCA eigenvectors using the projection of the exact 
SED on each PCA eigenvector. The blue lines are the results obtained using the ANN-based reconstruction. Note that the real SEDs are 
not seen because the rest of curves overlap. 



below 10 _1 , with wavelength regions close to 10~ 3 . 

An example of the ability of the ANN to approximate 
the database is shown in Fig. |4j where the black line 
is the exact SED obtained from the database (note that 
although only part of the database was used in the train- 
ing, these SEDs are obtained randomly from the full 
database). The red line is the SED reconstructed us- 
ing only the first 13 eigenvectors but using the correct 



SED for the projections along the eigenvectors. The blue 
line is the SED reconstructed using the neural networks. 
Note that differences are hardly noticeable and are well 
below any possible observational error or indeterminacy 
in the physical properties of the AGN. 

3.3. Advantages 

There are two main advantages of the approach fol- 
lowed in this paper. On the one hand, it opens the pos- 
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Table 1 

Centaurus A high spatial resolution nuclear 

DENSITY MEAS UREM ENTS FROM IME ISENHEIM ER ET AlI 

(120071 ) and IRadomski et al.I (120081 ). 



Filter Central wavelength (/im) Flux Density (mjy) 



NACO J 


1.28 


1.3±0.1 


NACO H 


1.67 


4.5±0.3 


NACO Ks 


2.15 


33.7±2.0 


NACO L' 


3.80 


200±40 


T-ReCS Si2 


8.74 


710±40 


T-ReCS Qa 


18.3 


2630±650 



sibility to interpolate in the database, so that it is now 
possible to calculate SEDs for combinations of parame- 
ters that were not present in the original grid. In princi- 
ple, if the SEDs depend smoothly on the parameters, the 
neural networks should have captured all the variability 
and there is no necessity to improve the griding of the 
original database. On the other hand, the synthesis of 
the SED is extremely simple and fast because all the de- 
tails of the calculations inherent to the CLUMPY model 
are approximated by the neural networks. One has to 
calculate the 13 projections of the SED onto the PCA 
eigenvectors by evaluating the 13 neural networks given 
by Eq. (|C3[) . Then, the reconstruction of the SED is ob- 
tained by adding the first 13 PCA eigenvectors weighted 
by these 13 projections. In terms of computational time, 
this makes it possible to synthesize ~ 10 4 SEDs in just a 
minute, so that we obtain a gain in time of a factor 10 3 or 
larger. The ANN+PCA approach can be essentially con- 
sidered as a huge compression of the database. Instead 
of saving the whole database, one only needs to save 
the weights of the neural networks and the PCA eigen- 
vectors. In our case, the complete database amounts 
to -430 Mb, while the ANN+PCA approach amounts 
to —40 kb, with a reduction factor that is close to 10 4 , 
with the added benefit of being able to easily interpolate. 
Of course, this improvement in the calculation speed is 
compensated by small differences as compared with the 
correct SEDs. 

3.4. Simulating filter photometry 

For the cases in which the observations are of filter 
photometry kind, once the SED is obtained with the pre- 
vious formalism, it remains to simulate the effect of the 
filters on the simulated SED. Given that 4>(X) is a fil- 
ter normalized to unit area used to obtain a point in 
the observations, the synthetic value is obtained by just 
evaluating: 

/oo 
F(A)0(A)dA, (1) 
- OO 

where A c is the central wavelength of the filter, obtained 
as: 

/oo 
\<f>{X)d\. (2) 
-oo 

Both integrals are approximated in BayesCLUMPY with 
a very simple trapezoidal quadrature: 

/(Ac) = ^WiFfAiMA,), (3) 

i 

with Wi the weights of the trapezoidal quadrature. 



4. ILLUSTRATIVE EXAMPLE 

We have implemented this forward modeling code into 
a Markov Chain Monte Carlo sampling algorithm that 
is used to evaluate the posterior distribution function 
for the physical parameters once an observation is pro- 
vided (see appendix [5] for details). The filters at which 
the information is available are fully configurable from 
a database of filters belonging to different instruments. 
One only needs to select the filter and give the the ob- 
served flux, di, and its corresponding error. For the mo- 
ment, Gaussian errors with standard deviation o~i and 
upper limits within a certain (user-selectable) confidence 
level are possible. 

The simplest version of the code uses uniform priors 
for all the parameters and admit to give the upper and 
lower limits for every parameter. It is also possible to 
choose a Gaussian prior in case any parameter is known 
a-priori to be around a given value with a certain disper- 
sion. The central position and the width of the Gaussian 
are the only adjustable hyper-parameters 2 in this case. 
More complex prior distributions are straightforward to 
include in the code and can be fully configurable. 

Apart from the 6-dimensional vector of parameters of 
the CLUMPY model given by (a, Y, JV, q, Ty, i), we add 
other parameters like a vertical shift in logarithmic scale 
that accounts for the normalization in luminosity of the 
SED or the amount of interstellar extinction character- 
ized by an extinction law and the absorption at vis- 
ible wavelengths. We neglect the presence of extinc- 
tion in this work, but the effect of the normalization 
in luminosity is treated as a nuisance parameter 3 and 
all the results are presented with this parameter inte- 
grated out (marginalized). Since the marginal posterior 
distributions of the vertical shift are already an output 
of BayesCLUMPY, one can carry out inference over the 
total luminosity of the AGN. 

4.1. Observations 

High resolution infrared data have been compiled 
from the literature for the Seyfert 2 galaxy Centaurus 
A (NGC 5128) in order to construct a purely-nuclear 
SED. Centaurus A is the closest active galaxy, with 
its core heavily obscured by a dust lane, and conse- 
que ntly only visible at wavelengths longwa rds of 0.8 
^m (jSchreier et alJl!998t iMarconi et al.ll2000D . 

Near-infrared data obtained with Naos-Conica 
(NACO) at UT 4 hav e been compiled from 
iMeisenheimer et al.I ()2007l ). Conica is a high spa- 
tial resolution near -infrared imager and spectrograph 
(jLenzen et al.I 119981 ). that works togeth er with Naos, 
the N asmyth Adaptive Optics System (Roussct et al. 
1998), providing adap tive-optics corrected obser vations 
in the range 1-5 /xm. IMeisenheimer et al.I (|2007| ) found 
the nucleus unresolved at all wavelengths with a FWHM 
of 0.10" in the J band, 0.088" in the H band, 0.059" 
in the Ks band, and 0.090"in the L' band. The fluxes 
corresponding to this unresolved component have been 
compiled and reported in Table [TJ 

2 The name hyper-parameters are usually applied to describe pa- 
rameters that modify the prior distribution. We treat these hyper- 
parameters as fixed and adjustable by the user and we do not carry 
out any estimation over them. 

3 Parameters in which the model depends on but are of no in- 
terest in the parameter estimation. 
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Fig. 5. — Marginal posterior distributions for all the free parameters of the CLUMPY model considered here. Note that Ty and N are 
nicely constrained. Other distributions only favor certain values of the space of parameters. The vertical shift of the SED, treated as a 
nuisance parameter, has been marginalized. 



High spatial resolution mid-infrared d ata of the nucleus 
of Cen taurus A are also compiled from iRadomski et al.l 
(2008). The observations were taken in the Si2 filter 
(8.8 ixm) and in the Qa band (18.3 fim) using the mid- 
infrared imager /spe ctrometer T-ReCS on Gemini South 
(|Telesco et alJ ll998). The core is also unresolved in this 
range, surrounded by a diffuse extended emission. In 
these bands, the upper limits to the size of the unresolved 
nucleus are 0.19" at 8.8 /jm and 0.21" at 18.3 /im at the 
FWHM level. 

With these high spatial resolution density measure- 
ments of the unresolved component of Centaurus A, we 
can construct a purely-nuclear SED of this galaxy, to be 
fitted with the CLUMPY models, as an example of the 
use of our tool. 

4.2. Bayesian Analysis 

The parameter estimation results presented in this sec- 
tion have been carried out with a Markov Chain of length 
6 x 10 5 , of which we take out the initial 40% as a burn-in 
(transitory initial portion of the Markov Chain in which 
the algori thm is not correctly sampl ing the distribution; 
see, e.g.. lNeallll993t iGregorvl I2005D . Our experiments 
have shown that the transitory phase is quite short and 
that the burn-in can be safely reduced to just the first 
~ 10% of the chain without much problem, thus increas- 
ing the quality of the sampling. Although a 40% burn- 
in is surely excessive, the fact that the MCMC scheme 
works very fast (less than half a minute for a 6 x 10 5 
chain), we prefer to run a longer chain and maintain the 
large burn-in. This way we make sure that the MCMC 
algorithm is well mixed and we are sampling correctly 
the posterior distribution. 

The one-dimensional marginalized posterior distribu- 
tions are shown in Fig. [5] for all the free model param- 
eters. These distributions are obtained as histograms 
of the Markov Chain for each parameter due to the au- 



tomatic marginalization properties of the MCMC tech- 
nique. This information encodes, for every parameter, 
the effect of ambiguities and degeneracies (the marginal- 
ization of the posterior takes into account all the possible 
values of the rest of parameters weighted by their proba- 
bilities) and summarizes the statistical properties of the 
estimation for each parameter. Uniform priors have been 
employed, leaving for a later publication the analysis 
of AGN in which a-priori knowledge might be present 
(Ramos Almeida et al., 2009, in preparation). There- 
fore, these posterior distributions have to be compared 
with uniform distributions giving equal probability to all 
combinations of parameters. When the observed data in- 
troduces enough information into the problem, the pos- 
terior distributions clearly differ from the uniform distri- 
bution. The uniform distributions are truncated to the 
following intervals: a = [15, 75], Y = [5, 30], N = [1, 15], 
q = [0,3], Ty = [10,60] and i = [0,90]. These values are 
based on physically reasonable assumptions that avoid 
non-realistic solutions. 

Figure [5] clearly indicates that all the parameters are 
nicely constrained for Centaurus A. The marginal distri- 
bution of ry, with its asymmetric shape, is showing us 
that the observations are able to put an upper limit to 
its value. The calculation of the confidence intervals in 
this histogram gives us that the upper limit to Ty is 11.2 
at 68% (lcr level) confidence and 13.0 at 95% confidence 
(2a level). Concerning N, there is a tail towards large 
values that produce slightly asymmetric error bars. If we 
summarize the histogram using the median, we find that 
N = 3.1±o j at 68% confidence and N = 3.1±%% at 95% 
confidence. It is also possible to use different quantities 
to summarize the statistical information, like the poste- 
rior mean or the posterior mode (maximum-a-posteriori 
or MAP), although the relevant information is provided 
by the full marginal posterior distribution. 

The marginal posterior for Y also indicates that it is 





a nicely constrained parameter, again with asymmet- 
ric error bars due to the enhanced tail. We obtain 
Y = 11.5±|;| at 68% confidence and Y = 11.5±| | at 
95% confidence. A different behavior is found for a, q 
and i. In the three cases, there is a region of the space of 
parameters that is favored with respect to others. For in- 
stance, the Bayesian analysis gives a larger probability to 
inclinations close to 90° (the MAP value is ~ 90°), prac- 
tically discarding angles close to 0°. The same happens 
for q, in which data favor values close to zero. Summa- 
rizing, we get i > 78°, a > 56° and q < 0.47 with 68% 
confidence. 

Correlations between the parameters can be under- 
stood at the light of the two-dimensional marginal dis- 
tributions of Fig. [6l where the contours mark the 68% 
and 95% confidence regions. Instead of plotting all pos- 
sible combinations of parameters, we only show six cases 
that are representative of the general behavior. It can 
be seen that the observed data discard large values of 
N, Y, Ty and q. The plot a — i presents weakly con- 
strained parameters, although it is clear from this figure 
that large values of i are favored (in accordance with the 
Type-2 classification of Cen A) . Small values of i are not 
preferred and become slightly more likely as a increases 
(i.e., as the probability that the central engine is blocked 
from view). In other words, the results show that Type-1 
orientations (small i simultaneous with small a) are not 
favored by the data. We present the physical interpreta- 
tion of these results in Ramos Almeida et al. (2009, in 
preparation). 

It turns out interesting that there is not much ambi- 
guity in the parameters so that the available data is able 
to constrain the large variability of the clumpy torus 
models. One should expect to obtain even more con- 
strained marginal posterior distributions when increasing 
the number of filters. 

Although the solution to the Bayesian inference prob- 



lem are the posterior distributions shown in Fig. [5j one 
can try to represent the models corresponding to the me- 
dian or the maximum-a-posteriori values of the parame- 
ters (within the confidence intervals) to visually compare 
them with the observations. The maximum-a-posteriori 
model is displayed in solid line in Fig. [7] This is obtained 
using the combinations of parameters that maximize the 
posterior distribution. The dashed line shows the model 
obtained using the medians of the marginal posterior dis- 
tribution for each parameter. Finally, the dashed line 
tries to give an idea of the range of variability of the 
compatible models. It is built by synthesizing SEDs for 
all combinations of parameters taking into account their 
confidence intervals around the median value. 

5. CONCLUSIONS 

This paper presents a computer code for the Bayesian 
analysis of nuclear SEDs of AGN using CLUMPY mod- 
els. This approach allows us to obtain the full solution 
to the inference problem in terms of posterior probability 
distributions of the model parameters. These probability 
distributions take into account the a-priori information 
about the parameters and the information introduced by 
the observations. Presently, the prior distribution can be 
selected to be uniform in an interval, Gaussian or Dirac 
delta. According to the observations, the code admits 
observations corrupted with Gaussian noise and/or up- 
per limit detections. 

The machine learning technique based on the combi- 
nation of the PCA decomposition and the application of 
artificial neural networks for the approximation of the 
database leads to a gain of a factor 10 4 in disk storage 
and 10 3 in execution time. As a sub-product, it pro- 
vides the possibility to interpolate in the database, thus 
it becomes feasible to generate SEDs for combinations of 
parameters not present in the original grid. This charac- 
teristic will be of great importance for the analysis of the 
response function of the SED in certain spectral ranges 
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Fig. 7. — High-resolution nuclear SED of Centaurus A, con- 
structed with near- and mid-infrared data shown in Table [l] The 
solid line presents the maximum-a-posteriori estimation of the SED 
while the dashed line is the one calculated with the median of the 
marginal posterior distributions of the parameters. The shaded 
region is the range of SEDs compatible within the 68% confidence 
interval for each parameter. 



of interest to the model parameters (derivatives of the 
SED with respect to the model parameters). 

Although the space of parameters is very large, the 
approximation properties of the method are very good, 
giving differences between the exact and the approximate 
SED with a standard deviation below 0.1 dex in the spec- 
tral window of interest. These errors are clearly below 
any uncertainty in the observations, so that the final re- 
sults will not be dominated by the approximation step. 
This good behavior is possible because of two reasons. 
First, the variation of the SEDs is notably smooth when 
the physical parameters are varied. This greatly facili- 
tates the interpolation properties of the artificial neural 
networks. Second, the precision needed in the approxi- 
mation is not specially significant and one only needs to 
train the neural networks until the differences between 
the exact SEDs and the reconstructed SEDs are below 
the observational errors. 

In order to demonstrate the output of the code, we 
have shown an example with Centaurus A. The anal- 
ysis of the marginal posterior distributions shows that 

4 http : / /www . ittvis . com/ idl 



some parameters are nicely determined by the data, while 
other parameters remain less constrained. Although one 
can summarize the marginal posteriors using modes, me- 
dians and/or means, together with confidence intervals, 
the true solution to the problem from the Bayesian point 
of view are the histograms shown in Fig. [5l 

Although BayesCLUMPY is focused towards the anal- 
ysis of nuclear SEDs of AGN using CLUMPY mod- 
els, the core of the method remains completely general 
and can be applied to a myriad of problems for which 
one is interested in fitting an already built database 
of models to observations. Cases like the database of 
synthetic spectra comput ed for the GAIA project (e.g., 
iBrott fc Hausc hildt 2005|) or the analysi s of the spectral 
energ y distributions of protostars (e.g.. iRobitaille et al.l 
120071 ) are examples of potential applications. Apart from 
the gain in speed in the analysis, one would be able to 
carry out a fully Bayesian analysis of the observations, 
thus opening the possibility of including prior informa- 
tion and/or carrying out marginalization of nuisance pa- 
rameters. Furthermore, Bayesian model selection tech- 
niques would facilitate the objective selection of several 
competing models for explaining the observables. 

BayesCLUMPY is coded in Fortran 90 with a graphi- 
cal front-end developed in IDL 4 . We offer BayesCLUMPY 
to the astrophysical community with the hope that it 
will help researchers to take advantage of the CLUMPY 
models for analyzing observed SEDs. To get a copy, it 
suffices with making an e-mail request to the authors of 
this paper. 



We thank Nancy Levenson, Moshe Elitzur, and Ana 
Perez Garcia for useful comments. Financial sup- 
port by the Spanish Ministry of Education and Sci- 
ence through projects AYA2007-63881 and AYA2007- 
67965-C03-01, the European Commission through the 
SOLAIRE network (MTRN-CT-2006-035484) and the 
Spanish MEC under the Consolider-Ingenio 2010 Pro- 
gram grant CS D2006-00070: First Science w ith the GTC 
(http://www.iac.es/consolider-ingenio-gtc/) is gratefully 
acknowledged. We also acknowledge "Los Piratas" for 
inspiring the development of this work. 



APPENDIX 

A. BAYESIAN INFERENCE 

A.l. Fundamental considerations 

BayesCLUMPY is a computer code that is focused towards analyzing observed SEDs using theoretical models and 
carrying out inference over the par ameters of the theoretica l models. It is built under the framework of the Bayesian 
approach for the inference (see e.g.. lNeal 1993: lGregorvl[2005l ). of which we briefly summarize the main ideas. Let M be 
a model that is proposed to explain an observed data set D. In our case, D is usually the set of points observed using 
filter photometry, but more information can be introduced if the observation is of spectroscopic type. Additionally, 
M is the CLUMPY model described by iNenkova et al.l (j2Q08al pbl) that we have briefly described in Sjl Let I be a set 
of sensible background a-priori knowledge about the problem (for instance, the filters used for the observations). In 
general, the model M is described by a set of equations or algorithms that depend on a vector of parameters, 9. These 
parameters usually have a physical meaning and our aim is to obtain information about these parameters from the 
observations. Inside the Bayesian framework for inference, our approach is the one of parameter estimation. In the 
case at hand, the vector of parameters contains the six free parameters of the CLUMPY model already described in 

m 
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In general, due to the presence of noise in the observations, any inversion procedure is not complete by just giving 
the values of the model parameters that better fit the observations. The full inference problem has to provide the 
posterior probability distribution function (PDF) p(9\D, I) that describe the probability that a given set of parameters 
9 is compatible with the observables D given the set of background a-priori assumptions /. As a consequence, 
statistically relevant information can be obtained from this PDF by marginalization (integration) of unimportant 
parameters. Specifically, the marginalization of all the parameters except the one in which we are interested will 
give the probability distribution function of the parameter taking into account all possible compatible values of the 
rest of parameters. Therefore, ambiguities and degeneracies in the parameters are translated into marginal posterior 
distributions with heavy tails. The cornerstone of the Bayesian approach to inference is the Bayes theorem, that relates 
the posterior distribution p(B\D, I) with any a-priori knowledge and the information introduced by the data: 



p(0\D,I) 



p(D\0,I)p(8\I) 
p(D\I) ' 



(Al) 



where p{D\I) is the so-called evidence, p{6\I) is denominated the prior distribution and p(D\6,I) is the so-called 
likelihood. 

The evidence, equal to the integral of the posterior distribution over the parameter space, plays no role in the context 
of parameter estimation because it is a constant that does not depend on the model parameters 6. However, it turns 
out to be crucial in the context of model selection. Strictly speaking, when one carries out parameter estimation in the 
Bayesian context, one should indicate the estimation of the parameters, the error bars and the value of the evidence. 
This way, other researchers can compare their results with already published ones. The main difficulty is that the 
evidence is very difficult to calculate and specifically designed algorithms are needed ([SMlling 200l IGregorvll2005h . 

The prior distribution, p(6\I), contains all relevant a-priori information about the parameters of the model. Usually, 
unless some information is available about the value of some paramete rs, it is common to use uninformative priors like 
bounded uniform distributions or Jeffreys' priors (e.g.. [Gre gory 2005|). In case our a-priori knowledge of a parameter 
is sufficient to better constrain the value of a parameter, the Bayesian approach can easily introduce the information 
into the inference process by appropriately setting the prior distribution p(9\I). Among the options, one can select 
exponential distributions if large values of the parameters are less probable, Gaussian distributions around a given 
value with a certain width if there is a region of the space of parameters with more probability, etc. 

Finally, the likelihood p(D\6,I) is a distribution that gives the probability that the observed data set has been 
obtained using the set of parameters 6. Assuming that the observables are represented by the vector d of length N 
and that the model M evaluates to the vector y of the same length contaminated by a noise component e, we can 
write: 

Hi = di + a, Mi. (A2) 

When the chosen model parameters exactly correspond to those of the observed data set, the distribution of differences 
Hi — di has to follow the distribution of the noise. Assuming that the noise is Gaussian distributed with a variance 
described by the vector <x 2 , the likelihood function is given by: 



A' 



p(D\d,I) oc Y[cxp 



(A3) 



although other distributions can be used. For instance, in cases where only the upper of lower limit is known with 
a given confidence, it is possible to use skewed likelihood functions that appropriately take this into account. For 
simplicity, we assume one-sided Gaussian likelihoods centered at zero in which o~i is adjusted so that the ratio of the 
integral of the likelihood between and di and its total area equals the confidence level of the observation. 

Summarizing, Eq. (|A1[) states that the probability that a model M becomes plausible after the data D has been 
taken into account (posterior) depends on how plausible the model was before presenting the data (prior) and how 
well the model fits the data (likelihood). 

A. 2. Technicalities: the Markov Chain Monte Carlo method 

In order to calculate the posterior probability distribution function for one parameter and give estimations and 
confidence intervals, we have to marginalize (integrate out) the rest of parameters from the full posterior distribution: 



p(0i 



d,i) = J &exd0 2 - ■ -Mi-xMi+i - ■ -&e N ^p{e\Dj). 



(A4) 



To th is end, BayesCLUMPY utilizes a Markov Chain Monte Carlo (MCMC: [Metropolis et al.lll953l:rNea]||1993l; [Gregory! 
2005) scheme based on the Metropolis algorithm. The output of the MCMC method is a chain of models whose 
probability distribution follows the posterior distribution function. The MCMC technique can be also considered as an 
integration method that returns marginal probability distribution for each parameter in the model. As a consequence, 
the converged final Markov Chain obtained for each parameter automatically gives, after making histograms, its 
marginal posterior distribution (integrating out the rest of model parameters). Technically, our MCMC method works 
by proposing models using a multivariate Gaussian proposal distribution and accepting or rejecting the proposed 
models based on a standard Metropolis acceptance rule. The proposal density distribution used in the first steps of the 
chain is a multivariate Gaussian with diagonal covariance matrix that is set to 10% of the allowed range of variation 
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of the parameters. After a configurable initial period, the proposal density is changed to a multivariate Gaussian 
with a covariance matrix that is estimated from the previous steps of the chain. In order to improve convergence, the 
covariance matrix is modified by a parameter that assures that the accept ance rate of proposed models is close to 25%, 
a value that is close to the theoretical optimal value for simple problems (|Gehnan et al.lll996f) . 

Although MCMC methods represent a huge step forward in the practical application of Bayesian methods to the 
inference, one of the most important drawbacks is the necessity to sample the whole posterior probability distribution, 
something that can be very time consuming. Typical MCMC runs require Markov Chains of the order of 10 4 -10 5 steps 
in order to end up with correctly converged chains. In case the evaluation of the forward model is not negligible, the 
total time can be quite large and the systematic analysis of different observations remains completely unpractical. 

B. PRINCIPAL COMPONENT ANALYSIS 

Let us define O as the A mo d c i s x N\ matrix containing the wavelength variation of all the theoretical SEDs of the 
database where the mean SED has been substracted. The principal components can be found as the eigenvectors of this 
matrix of observations, so that they can be obtained by just diagonalizing the matrix O. Since we have iV mo deis N\, 
this matrix is not square by definition and the dimension of the matrix can be so large that computational problems 
can arise. However, it can be demonstrated that the right singular vectors of the matrix O are equal to the singular 
vectors of the covariance matrix: 

X = OtO, (Bl) 

that can be calculating with the Singular Value Decomposition algorithm (SVD; see, e.g.. iPress et alj [1986) . The 
matrix X is the N\ x N\ covariance matrix and the superindex f represents the transposition operator. The same 
applies to the left singular vectors, which are also eigenvectors of the covariance matrix X' = O O^. The matrix X' 
has dimensions N mo deis x -^models and is typically much larger than the matrix X. However, both descriptions are 
completely equivalent. The i-th. principal components, 6j, fulfills: 

X6 4 = k t br, (B2) 

with hi its associated eigenvalue. Writing all the eigenvectors as column vectors in the unitary matrix B of dimensions 
N\ x N\ , since this represents a complete basis of the database, the observations can be written as a linear combination 
of them as follows: 

O = C B f , (B3) 

being C the iV mo d e is x N\ matrix of coefficients, whose Cij element represents the projection of the observation i on 
the eigenvector j: 

C = O B. (B4) 

The dimensionality reduction is carried out by using the matrix 6 that contains only the first N Ie d eigenvectors that 
have been retained as containing the majority of signatures, so that, we end up with the matrix of approximate SEDs: 



O = (O BJ B T . (B5) 

C. ARTIFICIAL NEURAL NETWORK 

The function that the ANN whose topology corresponds to that shown in Fig. [2] proposes depends on the six- 
dimensional vector of parameters 6 = (cr, Y, N, q, ry , i) and can be represented as: 

Cfcm = f m (0k), (CI) 

where f m (dk) are highly non- linear functions whose functional form is given explicitly below in Eq. (|C3[) . The 
subindex k labels all CLUMPY models while the subindex m < N' labels all the artificial neural networks built for 
approximatin g the projection of the SEDs on each PCA eigenvector. Using Eqs. (|C1[) and (|B5|) . the likelihood function 
given by Eq. (1A3|) can be written as: 



N 



p(D\d,I)K J]exp 



(C2) 



where bj is the i-th. wavelength points of the j-th PCA eigenvector. 
The output of each neural network can be expanded to read: 



N h 

j'=l 



.1=1 



w l Am)9 l k + Uj(m) 



(C3) 



where the weights Vj (m) , Wj (m) and the bias Uj (m) represent free parameters that are optimized during the training 
process. The symbol Nh stands for the number of neurons in the hidden layer and this essentially determines the 
number of weights and biases or, in other words, the complexity of the network. 
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The training of the ANN is done by modifying, for fixed to, the weights and biases until minimizing the the quadratic 
difference between the true value of Ckm and the values returned by the artificial neural network for all the models 
included in the training: 

Strain 

(=1 

where -/Vtrain is the number of models included in the training. It is important to point out that neural networks suffer 
from the well-known problem of over-fitting if the number of weights is too large. When a neural network is over-fitting 
the training set, it looses generality and starts to "mimick" the fine details of each sample (similar to what happens 
when a set of noisy points is fit ted with a very high-degree polynomial) . Although over- fitting can be overcome with 
the aid of Bayesian techniques (MacKay 1992b. a), we prefer to use the less refined method of having a validation set 
as explained in the main text. 
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